CAGEF_services_slide.png

Introduction to Python for Data Science

Lecture 03: Reading, writing, and wrangling data


0.1.0 About Introduction to Python

Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.

The structure of this course is a code-along style; it is 100% hands on! A few hours prior to each lecture, the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!

As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).

0.1.1 Where is this course headed?

We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:

and get you to a point where you can:


0.2.0 Lecture objectives

Welcome to this third lecture in a series of six. Today you will dive into more detailed data structures, packages that works with them, and build up towards our a more standard data science structure - the DataFrame.

At the end of this lecture we will aim to have covered the following topics:

  1. Wide vs. long-format data and the Tidy data philosophy
  2. Importing Excel spreadsheets
  3. Data wrangling
  4. Exploratory data analysis

0.3.0 A legend for text format in Jupyter markdown

grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink

... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.

Blue box: A key concept that is being introduced
Yellow box: Risk or caution
Green boxes: Recommended reads and resources to learn Python

0.4.0 Data used in this lesson

Today's datasets will focus on using Python lists and the NumPy package

0.4.1 Dataset 1: miscellaneous.xlsx

An Excel book of five sheets, where each sheets is a dataset. We will use this dataset to demonstrate how to import Excel files into Python using Pandas.

0.4.2 Dataset 2: The Human Microbiome Project

This is a dataset resulting from the phase 1 of the Human Microbiome Project (HMP). The HMP was a joint effort led by the American Institute of Health to characterize the microbiome of healthy human subjects at five major body sites using metagenomics (A.K.A. environmental DNA) and 16S ribosomal RNA amplicon gene sequencing. The dataset that we will use are the results of 16S rRNA gene sequencing, a technique where this gene is used as a marker for taxonomic identification and classification.

In very general terms, the microbial DNA is extracted from a sample and the 16S rRNA gene is amplified using polymerase chain reaction (PCR), the gene amplicons sequenced, and the resulting sequences are matched against a reference database. The results are presented as counts of Operational Taxonomic Units (OTU), which are the equivalent to microbial species for data interpretation.

There are two datasets located in the data folder which we're interested in:

These files, however, are excessively large - especially the metadata file. Therefore we'll be using a subset of the those files called:

0.4.3 Dataset 3: subset_taxa_metadata_merged.csv

This is a copy of our formatted and merged data. It's a smaller subset to the final for doing some simple exploratory data analysis towards the end of class.


0.5.0 Packages used in this lesson

IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.

random is a package with methods to add pseudorandomness to programs

numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.

os pandas matplolib


1.0.0 The long and the wide of it - formatting your data for analysis

1.0.1 Defining our data terms

1.1.0 Wide versus long format

Wide and long (sometimes un-stacked and stacked, or wide and tall, wide and narrow), are terms used to describe how a dataset is formated.

In a long formated dataset, each column is a variable and the results of each measured variable are stored in rows (observations). In contrast, in a wide formated data not every column is necessarily a variable so you can have several observations of the same type of variable in the same row. The names long and wide come from the general shape that the two data formats have.

For data science applications, long format is preferred over wide format because it allows for easier and more efficient computations, data subsetting and manipulation. Wide format is more friendly to the human eye and easier to work with when data needs to be manually recorded/input. Therefore, having the ability to interconvert between these two data formats is a valuable and required skill. The following is a general scheme of wide- (left) and long-format (right) datasets:

wide_and_long_formats.png

To read more about wide and long formats, visit https://eagereyes.org/basics/spreadsheet-thinking-vs-database-thinking.


1.2.0 The tidy data philosophy

Why tidy data?

Data cleaning (or dealing with 'messy' data) accounts for a huge chunk of data scientist's time. Ultimately, we want to get our data into a 'tidy' format (long format) where it is easy to manipulate, model and visualize. Having a consistent data structure and tools that work with that data structure can help this process along.

Tidy data has:

  1. Each variable form a column.
  2. Each observation form a row.
  3. Each type of observational unit form a table.

This seems pretty straight forward, and it is. It is the datasets you get that will not be straight forward. Having a map of where to take your data is helpful to unraveling its structure and getting it into a usable format.

The 5 most common problems with messy datasets are:

Fortunately there are some tools available to solve these problems.


1.3.0 Reading in data

1.3.1 Looking closer at our data

In our case, we have two datasets that we've already imported with our packages:

1.3.1.1 hmp_metadata

1.3.1.2 hmp_taxa

1.3.2 Our data-wrangling goal

We will use Pandas to merge the two files into a single dataset and will introduce concepts of data wrangling along the way.

These datasets are part of the R package HMP16SData. Details on how to prepare the data for download can be found in the HMP16SData vignette. To read more about the HMP project, visit https://www.hmpdacc.org/.


2.0.0 Working with Excel spreadsheets

Given the popularity of Excel, it is common to find ourselves working with Excel books. Before importing the file, we need to know our current working directory (where in the computer are we working from?), change the directory if necessary, and list the files available for importation. For those purposes we use the built-in OS library

2.1.0 The os package provides access to the operating system

In order to access functions and files in your operating system, the os package provides utilities that facilitate navigating through directories and accessing system information. Here's a table with some helpful functions and their description.

Function Description
getcwd() Retrieve the current working directory. This is the location where Python thinks you're working from. This is usually the directory holding your Jupyter notebook.
chdir() Change the current working directory. This can be an absolute path (ie C:/Users/yourname/) or relative to where you currently are (ie data/)
listdir() List the files and folders in the current directory (default) or in the directory specified as a parameter.
mkdir() Make a new directory (aka folder) in the current directory (default) or by providing a relative or absolute path. Note: you will get an error if attempting to make a directory that already exists.
rename() Rename a folder or file. You can even move things to new directories as long as the path exists.
rmdir() Remove a directory but will give an error if the directory does not exist
remove() Remove a file (path) as long as it exists.

2.2.0 Use pandas read_excel() to import your data

So we covered an important series of steps that are helpful in the conversion of our data. Sometimes data might have other issues, which is all part of data wrangling. To simplify our process, however, the pandas package has already created a function to deal with headers and other issues with importing .xlsx files and yes, it runs openpyxl under the hood.

By default read_excel() will use the first sheet in the workbook but we can also use the sheet_name parameter with the actual sheet name, or it's sheet position.


2.2.1 Use pandas ExcelFile() class to get a list of sheet names

Although we can read in sheets by position, unless we know the sheet names apriori it makes it hard to call on a sheet name specifically. We already saw that it can be done quite easily with the openpyxl package but we can also intialize an ExcelFile object which will have the .sheet_names property.

In fact, over the years, the function ExcelFile.parse() has become virtually identical to the read_excel() function if we were to take a look under the hood.

You can also select some specific sheets to read in


2.2.2 Read multiple sheets into a dictionary of DataFrames

Suppose you wanted to select multiple sheets. The read_excel() function will parse through a list [] of sheet indices and names to store them as entries in a dictionary object using the sheet_name arguments as keys. With this parameter you can:

  1. Mix indices and sheet names in your list!
  2. Use None to open all sheets in your file

3.0.0 Importing comma-separated values (CSV) files

Next, we will import the HMP files we discussed at the top of class and reformat them in such way that we can join the two files into a single, long-format Pandas data frame. The image below is a screenshot of the two files that make up of the HMP dataset: One file with metadata and OTU counts, and one with the OTU-taxa information. How would you do the joining? What could we use as a key to join the two datasets? (Think of this question as "what do the two datasets have in common?).

Importing and exploring your dataset, and designing a data wrangling strategy, is something that you should always do before the start of reshaping your dataset.

Use the next 10 minutes to explore the files using Excel and to propose a data wrangling strategy or even start reshaping your data. Avoid scrolling down for now as the answers are in this notebook. Also, make sure to make copies of the two files and use those for your exploration. (Data inspection is also done in Python, of course, but for the sake of this exercise, use Excel.)


3.0.1 A general strategy for data wrangling

Here is a general overview on how to convert the files into long format and joining them into a single data frame:

Metadata file: hmp_metadata

  1. Import and explore the file
  2. Create a subset: This is a very large file and doing wrangling on the entire dataset will be very time consuming and computationally intensive. Create a subset will make our job more efficient. Once we get the subset properly format, we will apply all the changes to the full hmp_metadata.
  3. Remove unnamed column 0: This column was the result of writing the dataset without indicating that we did not want the row indices to be added to the dataset. We do not need those.
  4. Convert the dataset into long format by moving all OTUs into a single column. The process of converting a dataset from wide to long format is generally known as "melting".

OTU-taxa file: hmp_taxa

  1. Import and explore the file.
  2. Take a subset and rename column 0 into something meaningful.
  3. Remove row 0 (all taxonomic ranks in single "cells").
  4. Move OTUs into a single column and then use taxonomic ranks as column names. This will require to go from wide to long and then an extra step to spread the dataset.

Objective

To merge metadata and OTU-taxa into a single file:

merged_dataset.png


3.1.0 Data wrangling with Pandas

Data preprocessing, a.k.a. data wrangling or data munging, is the phase in which the data is prepared for downstream data analysis. In involves data cleaning, formating, handling of special characters in datasets, etc. Preprocessing datasets can be labor intensive depending on the initial state of the data, as we will see today, but Pandas simplifies this process to a great extent. "Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive." (https://pandas.pydata.org/docs/getting_started/overview.html).

Let's get down to it. First, import Pandas and Numpy

Now we are all set up to tackle file 1: human_microbiome_project_metadata_subset.csv. This file contains the OTUs count and metadata (data about the data).


3.1.1 Missing values are replaced with NaN

Taking a cursory glance at the DataFrame, we can see that Unnamed column we mentioned in our strategy but we also see that the 8th column SRS_SAMPLE_ID has a number of NaN values. It appears that we do not have some information in that column and so it was imported with a value of NaN. Perhaps because some samples were never submitted to the NCBI database or for any other reason, so we will take no action.

We should note that by default read_csv() and similar commands will recognize not only empty space ('') but other potential variants of empty values like NA, N/A etc and give these the NaN value. Furthermore, from our above command, you can set additional values to be recognized as NaN using the na_values parameter. In our case, we set a single space ' ' as a recognizable missing value. You can even set a dictionary of missing value characters specific to each column. We'll talk more about missing data soon.

Read more: Find more information about the import functions in pandas here

For now, let's take a closer look at the SRS_SAMPLE_ID column. Recall we can index columns by name (.loc[colName]) or by position (.iloc[position])


3.2.0 Data(Frame) inspection using methods and properties

As mentioned before, inspecting your data to understand aspects such as types, missing values, and shape, is a crucial part of data preprocessing. Recall that methods and properties like info(), head(), and .shape, are valuable starting points.


3.2.1 Check for NaN values with the isna() and any() methods

Although we quickly found NaN values in one of our rows using just a cursory inspection, there may be othe NaN values imported that we haven't seen or expected. We can check for more NaN values with the isna() DataFrame method which will return a same-sized object where any NaN or similar values are mapped to the value True.

To accomplish this task we'll also apply the DataFrame.any() method in-line with our isna(). The any() function is a core Python function that checks if any of the elements within an iterable (such as a list) have the value True. In the case of the DataFrame class, the any() method will evaluate along the specific axis. The any() method parameters include:

We'll also use the .columns property for the first time which will return an Index object containing all of the column names.

So, how many columns contain NaN?


3.2.2 Use the tolist() method to convert your objects

From above we see that all of our code creates an Index object which is part of the pandas package. It has some extra properties and methods that might be useful to use but we can also simply convert it to a simpler Python list if all we care about is the column names. To accomplish this we'll use the built-in method tolist() which will iterate through the object and produce a simple list of python scalars or pandas scalars for more complex values like Timestamps or intevals.

Food for thought: We could also use the list() function BUT things can get more complicated when working with nested lists or nested structures. The list() function was built as part of Python and doesn't necessarily have detailed behaviours for handling optional packages like pandas. Therefore, it is best to use the methods and functions provided (when available) by a package to coerce or re-cast objects.

3.2.3 .loc[] and .iloc[] can accept boolean lists as well

So we have a way to get a list of columns with NaN values. In fact we can use similar methods to retrieve column names for any columns that meet conditional criteria we are interested in. Suppose we wanted to cull a DataFrame down to the information we are explicitly interested in?

While we discussed the uses of loc[] and .iloc[] in the context of having column names and index positions, both of these methods can accept boolean series/arrays/lists as a means to filter for specific columns. Remember that multi-indexing uses a [row, column] format so how do we get just our columns with NaN data?


3.2.4 Additional value-checking mechanisms

Much like the isna() method, the DataFrame class has additional methods for checking for null value types like:

  1. isnull(): actually just an alias of isna()
  2. notna(): the boolean inverse of isna()

3.2.5 Use method chaining to reduce intermediate variables

What did we just do above there? While we've been encountering some similar examples in this section, we haven't really addressed this coding style. One of the great things in working with an object oriented language like Python is that we can use methods in a tandem order, also known as chaining. Since many methods or properties return an object, as long as we know what kind of object it returned, we can stack method calls on the resulting object. This is especially helpful in the pandas package where core functions are often present as class methods.

From above:

  1. hmp_metadata.isna() returns a DataFrame upon which we call...
  2. DataFrame.any() returns a Series upon which we call...
  3. Series.sum() returns a scalar in this case

Of course you need to know which objects you are dealing with and their methods but it makes a pretty simple line of code!


3.3.0 Subsetting your DataFrame

Now that we've explored our metadata file a little bit and know that there are some missing values in the SRS_SAMPLE_ID column, we can proceed to the next step of our journey: reshaping the data. Recall that we currently have some ~2000 columns representing operational taxonomic units (OTU) and a quantification of how many of those units are present within a sample.

3.3.1 Before reshaping hmp_metadata take a subset

A recap on the formating to be done on the metadata file:

  1. Import and explore the file $\checkmark\checkmark$.
  2. Create a subset: This is a very large file and doing wrangling on the entire dataset will be very time consuming and computationally intensive. Creative a subset will make our job more efficient. Once we get the subset properly format, we will apply all the changes to the full hmp_metadata.
  3. Remove unnamed column 0: This column was the result of writing the dataset without indicating that we did not want the row indices to be added to the dataset. We do not need those.
  4. Convert the dataset into long format by moving all OTUs into a single column. The process of converting a dataset from wide to long format is generally known as "melting".

As mentioned previously, preprocesing data with large datasets is a rather inefficient strategy -not so bad if you have access to vast computational power but still... There is quite a bit of trial and error involved to find out what works best on your data, and the larger the dataset, the longer it will take the output to be generated for you to inspect and evaluate it.

Instead, we take a representative sample of the full dataset, write code to preprocess it, and then apply the program we wrote to the full dataset. For example, the full hmp_metadata file has 43,149 rows and 2,898 columns, where columns 9 and onwards (0 indexation) are OTUs; Four OTU columns will be enough for data wrangling so we can ignore 2,889 columns for now. The easiest way to subset data would be using iloc[] or loc[].


3.3.2 Subset your data randomly with Numpy's random.choice() function

Sure we could simply take a simple subset of our data but we can also take a random subset selection that is representative of our data set. Taking random subsets is especially helpful if you are running certain tests on data for machine-learning or programming/testing new algorithms.

To accomplish our subsetting goals, we will use Numpy's random.choice() function to generate a representative subset of hmp_metadata. Of course we'll still want those first 9 columns - it's the OTU data that we want to randomize.

Random seed (pseudo-random number generator): A function used to ensure reproducibility when a program involves randomness. It is used in processes such as random number generation or sampling from a dataset.

Now that we know more about random.choice, let's use it. Given that the a arguments expects a 1-D array or an integer (see random.choice's documentation), we will use the shape attribute

Done. Subsampling from both rows and columns seems to be working as expected so now we can create a subset. How do we use those two lines of code to sample from hmp_metadata without ending up with booleans for a final ouput?


3.3.3 Make a range of numbers with np.arange()

Not quite what we wanted... We did not catch the first nine columns and we need those. On a side note, seeing only OTU columns is simply a matter of probability as there are thousands of OTU columns versus 9 non-OTU columns. Remember we are merely giving .iloc[] a list of index positions. All we have to do is provide a static and random portion and put them together.

To achieve this we'll use the Numpy arange() function. No that's not a typo. Python 3 already has a range() function that returns a generator (remember those?) for the range specified. Numpy's version returns an array for the range supplied from start (inclusive) to end (exclusive).

To put two arrays together, we use the np.append() function. Let's give it a try.


3.3.4 Use the DataFrame sample() method as an alternative to np.random.choice()

An alternative to random.choice() for random subsampling of either rows or columns is the sample() method. Using the parameters you can decide on:


3.4.0 Generate summary information for our data subset

Before we continue let's look deeper at our data subset to see what our data is like. We could have done this with our larger dataset but it would likely be overwhelming. We can use properties like .dtypes to find out the data types for our columns, and the method describe() to produce summary statistics for those columns too.


3.4.1 The describe() method works for numeric information

From above, we see that our summary statistics do not include the SEX, RUN_CENTER, HMP_BODY_SITE, HMP_BODY_SUBSITE, and SRS_SAMPLE_ID columns. Given that these are character-based, the describe() method cannot generate any summary statistics for these columns.

If, however, we explicitly provide non-numeric column(s) to describe() we are returned a truncated or altered version of the output. You can provide a single or multiple character columns with similar results.


3.4.2 Use the unique() method to examine values in your DataFrame

As we can see from above, there are 18 unique HMP_BODY_SUBSITE values. We can pull see what they are using the unique() method on a column from our DataFrame. This is a quick way to see the nature of values in your dataset as well. Let's give it a try.


3.4.3 Retrieve your DataFrame as a nested np.Array with the .values property

Another way to access portions or all of the values in your DataFrame, you can use the .values property which you can also subset with multi- or chained-indexing.


3.5.0 Reshaping your data

So we've discussed a few ways to generate subsets from our data. This is helpful when generating code to reshape our data because the original information can be extremely dense and, in some cases, computational intensive to just "play" with. We've even generated some summary information our subsets as well, just to help us sanity check for larger issues we may have overlooked when initially importing the data.

Let's see where we are in our steps:

  1. Import and explore the file $\checkmark\checkmark$.
  2. Create a subset: This is a very large file and doing wrangling on the entire dataset will be very time consuming and computationally intensive. Creative a subset will make our job more efficient. Once we get the subset properly format, we will apply all the changes to the full hmp_metadata $\checkmark\checkmark$.
  3. Remove unnamed column 0: This column was the result of writing the dataset without indicating that we did not want the row indices to be added to the dataset. We do not need those.
  4. Convert the dataset into long format by moving all OTUs into a single column. The process of converting a dataset from wide to long format is generally known as "melting".

Up until this point we haven't really altered any of the data we imported. As per our data wrangling plan, however, the next step is to remove column 0 from hmp_metadata. It is just a column with indices starting at 1 that was created when the file was written to disk (saved).

Before we start deleting anything from the dataset, inspect the dataset and pull out a list with column names. The fastest way to retrieve a list of column names is to cast the DataFrame as a list().


3.5.1 Remove a column using the drop() method

As usual, there are several ways to achieve the same goal. In this case, we want to remove the first column, unnamed 0, from our dataframe. We can remove that column by its index using drop() method from the DataFrame class. This is accomplished in a single call or one-liner (nickname for programs that are one line long).

You can complete this code using the index position or the column name itself.


3.5.1.1 Removing by index position can be dangerous

Even though the above code does its job as expected, it is very dangerous. Since it uses indexation to locate the column we wish to remove, once the current column 0 is removed, all other columns will be shifted one position to the left. As a consequence you will now have another column at index 0, and if you run this piece of code again, it will continue removing whatever column is present at index 0. You can end up removing many more columns than you intended, by mistake, with potential serious negative implications for your analyses.

Thus, it is safer to use a syntax that explicitly looks for the column of interest, like using the full column name or by partial matching (more on that on lecture 5).

If we try to run this same code again, we get a KeyError because there is no column called "Unnamed: 0" anymore.


3.6.0 Melting your variable columns into just two columns

Now that we've removed the Unnamed: 0 column from our DataFrame, let's see where we are in our steps:

  1. Import and explore the file $\checkmark\checkmark$.
  2. Create a subset: This is a very large file and doing wrangling on the entire dataset will be very time consuming and computationally intensive. Creative a subset will make our job more efficient. Once we get the subset properly format, we will apply all the changes to the full hmp_metadata $\checkmark\checkmark$.
  3. Remove unnamed column 0: This column was the result of writing the dataset without indicating that we did not want the row indices to be added to the dataset. We do not need those $\checkmark\checkmark$.
  4. Convert the dataset into long format by moving all OTUs into a single column. The process of converting a dataset from wide to long format is generally known as "melting".

3.6.1 Melting your variable columns with the melt() function

We can "melt" all of the OTU variable values into a single column. This essentially means we are going to compress all of the OTU column information into two columns. The first column holds information like a variable name (ie the OTU code), while the second column holds the value. Thus a new observation (row) is added for each variable (column) we are melting. Another important aspect of this step is that the unmelted information (untouched columns) in each observation must be unique to that observation. In our case, columns 0-6 hold the information that will remain untouched. That data will now be repeated multiple times for each new variable that is converted to an observation.

We'll accomplish this with the Panda function melt() or the identical DataFrame method of the same name. The key parameters to keep in mind are:

Unnamed 0 is gone. Next, we will move all OTUs into a single column by "melting" columns 9:len(hmp_metadata_subset.shape[1]) using Panda's melt()


3.6.2 Provide a list of column names to the melt() function

Can you explain why our melted data frame is 500 rows x 10 columns?

If you wanted to keep a subset of identifier columns that were not strictly grouped together you have a few options for creating that list to pass on to the id_vars parameter of melt():

  1. Make an explicit list of column names
  2. Slice and combine subsets of columns from your DataFrame. Just remember any excluded columns will become variable columns unless you explicitly name those as well!

3.7.0 Put your code together now that you've tested it

And... that was the last change to hmp_metadata. Now that we have the file in the shape we wanted, let's preprocess the entire dataset by combining all of the code snippets we've generated.

One file down. One more to go.


3.8.0 Data wrangling hmp_taxa

Recall our OTU-taxa file plan:

  1. Import and explore the file.
  2. Take a subset and rename column 0 into something meaningful.
  3. Remove row 0 (all taxonomic ranks in single "cells").
  4. Move OTUs into a single column and then use taxonomic ranks as column names. This will require to go from wide to long and then an extra step to spread the dataset.

Remember that the ultimate goal is to join the metadata and OTU-taxa datasets. Let's start by inspecting the OTU taxa set.


Check for NaN values within our data.


NaNs are much more abundant in this dataset compared to hmp_metadata. How many columns with NaN values are we dealing with?

So 198 or so columns contain incomplete data but we'll just leave those for now. One file down. One more to go.


3.8.1 Rename column 0 of hmp_taxa with the rename() method

We've imported and taken a look at our data. Recall our OTU-taxa file plan:

  1. Import and explore the file $\checkmark\checkmark$.
  2. Take a subset and rename column 0 into something meaningful.
  3. Remove row 0 (all taxonomic ranks in single "cells").
  4. Move OTUs into a single column and then use taxonomic ranks as column names. This will require to go from wide to long and then an extra step to spread the dataset.

Renaming columns from our DataFrame is pretty straightforward but does require some thought. First, renaming a single column could normally be a simple case but what happens if we want to rename multiple columns? The Pandas package lets us rename multiple DataFrame indices (rows) OR columns with the rename() method but we should take careful note of the parameters:

First let's take a subset of our data. There are only 7 rows so we'll keep all of those but we can trim how many columns we are taking for now. After, we'll rename this column as "rank_name" with the rename() function. This is an extra modification - extra because is not indispensable.


3.8.2 Remove row 0 from our DataFrame

So we now have a subset of our data and we've renamed that first column to rank_name. It's time to remove that first row from our dataset.

  1. Import and explore the file $\checkmark\checkmark$.
  2. Take a subset and rename column 0 into something meaningful $\checkmark\checkmark$.
  3. Remove row 0 (all taxonomic ranks in single "cells").
  4. Move OTUs into a single column and then use taxonomic ranks as column names. This will require to go from wide to long and then an extra step to spread the dataset.

From hmp_taxa_subset let us remove row 0 (consensus lineage). The OTU-taxa dataset is redundant: row 0 (consensus lineage) contains the same information as rows 1 to 6. This redundancy means that we have two alternatives to convert taxonomic ranks into columns and OTUs into a single column (as we did with hmp_metadata):

  1. Use only row 0, split the contents of each cell based on special characters such as colons and underscores, use the single letters as taxonomic ranks (e.g. c stands for "class"), and then melt and populate the dataset with the contents of the OTU columns. This approach requires the use of regular expressions (regex), and we are not quite there yet. We will revisit this same exercise in week 5 (regex) and will use this option to format this same dataset.

  2. Remove "consensus lineage" from the dataset and use the remaining six rows and all OTU columns. This alternative is more suitable for our Python programing skills at this point so that is what we will use.

Remember how we used the drop() method to remove a column from hmp_metadata? Now we can go ahead and use the same method to drop a row as well. By default it will assume you are referring to row indices (axis = 0) when you provide an index as an integer.


3.8.4 Melt our OTU taxa information

We've now completed some data reformat and trimming and we're onto the final step!

  1. Import and explore the file $\checkmark\checkmark$.
  2. Take a subset and rename column 0 into something meaningful $\checkmark\checkmark$.
  3. Remove row 0 (all taxonomic ranks in single "cells") $\checkmark\checkmark$.
  4. Move OTUs into a single column and then use taxonomic ranks as column names. This will require to go from wide to long and then an extra step to spread the dataset.

What will our melted DataFrame look like?

rank_name OTU taxonomic_level
SUPERKINGDOM OTU_97_1 Bacteria
PHYLUM OTU_97_1 Firmicutes
CLASS OTU_97_1 Bacilli
ORDER OTU_97_1 Lactobacillales
FAMILY OTU_97_1 Lactobacillaceae
GENUS OTU_97_1 Lactobacillus
... ... ...

Challenge

Melt hmp_taxa_subset and create a new dataframe called hmp_taxa_subset_melt.


3.8.5 Redistribute your data with the pivot() method

Looking at our melted data, we have now converted the OTU columns into rows. Taking a closer look, however, is the information in a format that we want to work with? When we think of the identifier information from our other table, hmp_metadata, it is hard to imagine how we'll combine the tables (our ultimate goal).

Furthermore, keeping taxonomic ranks as rows will bring difficulties later on when we try to access the data for statistics and plotting. This format also generates six rows per OTU (six taxonomic ranks) instead of a single row per OTU. You can see that what are essentially identifiers of the same "observation" are therefore spread across 6 rows. This doesn't exactly follow our philosophy on tidy data.

Instead, let's convert the information in rank_name to column names, which will convert the redundant OTUs column values into a column of unique values, with the taxonomic_level values filling out the new columns made from rank_name.

To accomplish our conversion goals, we'll use the pivot() method which has the following relevant parameters:

Recall that we can directly call on column names as though they are attributes/properties of the DataFrame. As sanity check, let's see what our new column names would be.


3.8.5.1 Continuing with the pivot() method...

Challenge

We have 6 unique ranks and each will become its own column once we "pivot" the melted data frame, in addition to the OTU column. Use Panda's DataFrame pivot() method on the hmp_taxa_subset_melt dataframe, making sure that the OTU values become the index for the new dataset - this is key and will be used when joining the two datasets.


3.8.5.2 The pivot() method has no knowledge of your column order

As we can see from above, we did manage to spread our data out but the columns have been placed by alphabetical order. While that may not be a huge deal for Python when working with the data in later steps, our human brains betray us. We'd much rather have these information in the taxonomic order that we're used to.

To accomplish this we can reorder our DataFrame based on either an explicitly coded or data-generated list. Recall that using the [] indexing operation with a list will return a DataFrame with the columns in the order specified.

A quick note about the pivot conversion. You can see that the column names are returned in an Index object whose name is "rank_name" and the row indices are returned in an Index object named "OTU". This gives us the output we see when we call on the head() method of our pivoted DataFrame.


3.8.6 Combine our code to format the hmp_taxa dataset

Now we have our dataset in the way we wanted it. Let's apply the code to the entire dataset so we can finally join the two datasets.

Notice that the dataset does not have 999 unique OTUs. We have 999 rows because some OTUs have the same taxonomic assignment (this is explained by how OTUs are created using similarities against a consensus sequence). In fact, we have 85 unique genera.

Can you guess why unique() gives us 85 values while drop_duplicates().count() gives us 86 values?


3.9.0 Merge our two datasets hmp_metadata_tidy and hmp_taxa_tidy

Into the final stretch, now that we have our data in a tidy format, it is time to merge it together. This can also be referred to as joining our data. In order to successfully do so, we need at least one column from each DataFrame that we can use as a reference to combine the data. These columns can also be referred to as keys.

In our case, we will use the unique OTU identifiers as a key to join the two datasets by matching their indices. Other programs do a good job of merging tables for us versus using a copy-paste method in spreadsheet software. You can mess up your dataset and ultimately negatively affect your results.

There are several ways to join two datasets with the Pandas package:


3.9.1 Test for values in a DataFrame with the isin() method

Before we proceed with our example, we will subset the datasets in such way that they both have the same OTUs so we can get them to match -the same 10 OTUs in each dataset will be enough. The first task is to select 10 OTUs from either file and then use those 10 OTUs to subset both data frames.

The first step of choosing some OTUs from hmp_taxa_tidy is easy. To check for entries that exist in hmp_metadata_tidy we need to compare the correct column of information, and see if each element of that column is contained within our set of OTU values. The isin() method will return a DataFrame of booleans on whether any elements in our DataFrame object are contained within the values we supply.

Both datasets have the same OTUs.


3.9.3 Use the merge() function to combine two DataFrames

Let's take a closer look at the merge() function which has some of the following parameters

Let's combine our two subsets now hmp_metadata_10_otus and hmp_taxa_10_otus now that we've ensured they share the same keys. Note that this isn't strictly necessary. Each join type in the how parameter deals with merging observations based on keys in a different manner.

Join type Description
left Keep everything from the left DataFrame and only merge intersecting keys from the right DataFrame. Missing keys are filled with NaNs
right Keep everything from the right DataFrame and only merge intersecting keys from the left DataFrame. Missing keys are filled with NaNs
outer AKA a full outer join, creates the union of keys and merges based on these. Unmatched cases are filled with NaNs
inner Creates the intersection of the keys from left and right and merges based on these

The next step is to join the two datasets by a common column or index. If we decide to use a column, the columm that we use as key must have the same name in both datasets. If we decide to go with the indices as key for the merge, we just set merge()s arguments left_index and right_index to True.

In summary, we either set the hmp_metadata_tidy OTU column as index using the set_index() method OR make hmp_taxa_tidy's index into a column called OTU.

Check it out: Read more about merging here.

Let's merge the two datasets by their index.


3.10.0 Apply the final code to the "full" datasets

As simple as that! Now we have a single dataset so we achieved our last goal. Time to apply the code to the full datasets.

Disclaimer: This code will not run in your computer with the large datasets (no problem with the alternative subsets). It will crash. Merging the two files results in a file that is 20 Gb in size and has 12 million rows. If we started with two files, one was 251.2 Mb and the other 8.4 Mb, how is it possible that we end up with a 20 Gb plain text file?? Ultimately is the same amount of data... we just reshaped it...


3.10.1 You will always be wrangling data in one way or another

Hopefully you got a hang of what data wrangling (prepocessing) is about and why it often represents 80 to 90% of your data analysis time. The next time you start planning your experiment, it is advisable to also plan ahead how to record your data so you do not have too spend too long formating it.

As a bonus, if you are mining data or repeating experiments and always getting the same kind of data format, then once you have built these kinds of scripts you can reuse them!


3.11.0 Write your data to file with the to_csv() method

The Pandas DataFrame class has a built-in method for exporting our data to a various file formats like the comma- or tab-separated value formats. We can actually just use the to_csv() method for both file types because we can alter some of the parameters during the call:

Read more: You can go through more parameters by checking out the ?pd.DataFrame.to_csv command or going to the pandas documentation here.

3.11.1 Write to an excel file with the to_excel() method

Similar in idea to the to_csv() method, this will allow us to directly write our DataFrame objects as Microsoft Excel files. We just need to set the proper parameters:

Read more: You can go through more excel-writing parameters by checking out the ?pd.DataFrame.to_excel command or going to the pandas documentation here.

4.0.0 A quick introduction to exploratory data analysis

Now we are going to do some exploratory data analysis (EDA) on subset_taxa_metadata_merged.csv, a subset of the merged dataset (0.001%). It is always good practice to explore your to find more about aspects such as probability distributions, outliers, and central tendency and dispersion measures.


4.1.0 Get a summary of our information

So we've imported a dataset of ~12,500 rows across 16 columns. You'll note a new column that we didn't have in our prior dataset called "OTU". Previously we had left this as an index when we were completing our merges but now it exists as it's own column. Recall that at the time of importing this we could have used the index_col parameter to import OTU as we had it before.

Time to pull up a summary of the numeric information. Remember that we can use the describe() method to accomplish this.


4.2.0 Look through our count data for NaN values

Next, some histograms to check the count distribution of our variables. Let's start with by examining the count column for values that are not NaN. We'll use the notna() method to return a same-shaped boolean series.


4.3.0 Use the matplotlib.pyplot.hist() function to make a histogram

We can make a pretty straighforward histogram from our data using the hist() function from the matplotlib package. We'll use all of the default parameters and simply give it an array (in this case a Series object) of numbers to work with. There are a number of parameters you can set including how it bins the data.

Check it out: Read more about making histograms.

4.4.0 Use the Pandas cut() function to convert your Series into binned data

Much like the histogram above, we can create our own binned data using the cut() function which will accept as Series or 1-D array as input. Some helpful parameters are:

We can convert the returned output into a new column for our data, to further categorize it and then use the DataFrame hist() method to make a histogram. This actually calls on the same matplotlib hist() function we just used.


4.4.1 How you bin your data makes a difference

Notice that above, because our call to cut() is left-exclusive, that we drop all of the 0-value data from our count column. Compare the two histograms we've created. When we look at the dataset with our new binning, the microbial (OTU) counts are mostly around the 100 mark.

Is this an accurate representation of the data distribution? Perhaps another kind of visualization would be appropriate to the message we are trying to convey! Something to discuss in another class altogether...

4.4.2 Use the sort_values() method to help order your data

How do we select for values within our merged_subset DataFrame? Let's play around with our data by identifying genera with counts greater than 0 where genus is not NaN. We'll also sort the values in descending order.

To sort values within a DataFrame we use the sort_values() method whose parameters include:


4.4.3 Group your data with the groupby() method

Look at our output. For the most part it looks correct as we now have far fewer rows BUT a closer inspection shows us an error. We didn't achieve a true count of each genera as the OTU classifications got in the way. What we want to do is group the rows by one or more columns so we can summarize the count values for each grouping.

To accomplish this we'll use the groupby() method which can group our DataFrame with some of the following parameters:

We'll also use the sum() method in our chaining but it will be on a pandas.core.groupby.GroupBy object. If you look this up, you'll see that by default it calculates sums by group in each numeric column.

Let's turn genera_non_0_andabove into a barplot


5.0.0 Class summary

That's our second class on Python! You've made it through and we've learned about about working with data files:

  1. The tidy data philosophy.
  2. Reading in .xlsx and .csv file formats.
  3. Strategies for data wrangling - subsets, removing data, merging tables.
  4. Exploratory data analysis through summary and data grouping

5.1.0 Post-lecture assessment (12% of final grade)

Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 1-3 of Data Manipulation with pandas (Transforming Data, 950 possible points; Aggregating Data, 1300 possible points; and Slicing and Indexing, 1150 possible point). For optional practice, do chapter 4! It covers more on exploratory data analysis and a little bit of data visualization. This is a pass-fail assignment, and in order to pass you need to achieve a least 2550 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.

In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this where it shows the total points earned in each chapter:

DataCamp.example.png

Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.

You will have until 13:59 hours on Thursday, July 15th to submit your assignment just before the start of lecture that week.


6.0.0 Appendix

6.1.0 Resources


6.2.0 Acknowledgements

Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.

Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.


6.3.0 Use the openpyxl package to open excel files

In earlier years the xlrd package could be used to read excel spreadsheet files of the format .xls. However, in the past decade Microsoft has focused on the .xlsx format which the xlrd package supported but development on the xlrd package officially ended support for anything but the .xls format as of December 2020.

Now the official recommended package is openpyxl to open the more complex excel files. You can also use this package to make and save excel spreadsheet files so it's a good one to keep in your toolbox.

Let's start by importing the file miscellaneous.xlsx from the data directory by calling on the load_workbook() function.


6.3.1 Look at sheet names with .sheetnames property

Now that we've generated this Workbook object, we can look at some of it's properties, like the names of the various sheets using the .sheetnames property. Remember this will be holding data in a complex hierarchy where we'll have to break down the information into something like a DataFrame object.

From there we can access individual sheets using the ["sheetname"] indexing operators.


6.3.2 Access the values of a worksheet with the .values property

Now that we have pulled out a Worksheet object, we have to imagine that there are values there for us to use. The way to access this information on the various cells is with the .values property. We'll see, however, that it's not exactly a data structure like we imagine it.


6.3.3 Iterate through a generator with the next command

As you can see from above, we managed to convert our worksheet data into a DataFrame but it wasn't exactly a perfect conversion. Looking at the original object we get from the .values property, we get a generator object in return.

A generator is a special function that can make its way through an iterable object (like a list or dictionary) but can return a sequence of values and not just a single value. It will also yield it's execution wherever it was in the iterable object and return that state to you in another generator object. Sounds confusing right?

Simply put, this kind of function serves as an iterator for moving through the rows of our sheet value data one at a time using the next() command. This is helpful to know because looking closer at our DataFrame we see that the header information was deposited into the first row of our DataFrame rather than as column names. There are a few ways to remedy this but one way is to properly iterate through our generator with the next() command.

By default the next() command of our Worksheet.values generator will return a row from our values. Once it gets to the end of the structure, it will return a StopIeration error when calling next().

Let's take a look ourselves.

Why did we get the stop iteration? When we assigned it to a DataFrame above, it iterated through the whole .values generator and now there's nothing else to squeeze from it. We'll have to start over again.


6.3.4 The next() function returns a tuple

That's right, our good friend the tuple is back and therefore we can slice it with the [] operators. That means we can pull out specific "columns" that we're interested in from each row.


6.3.5 Use next() to extract a header for a DataFrame

Now that we know the inner workings a little more, we can grab the first row from the Worksheet.values generator and keep that for our columns. Then we can pass the rest of the generator to the initialization of the DataFrame. Let's see that in detail.


The Center for the Analysis of Genome Evolution and Function (CAGEF)

The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.

From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.

For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.

CAGEF_new.png